Substance abuse and drug addiction is a dangerous plague that harms both human body and the society. Identifying population with high risk of drug abuse can help establishing medical intervention and preventing social crime. Our binary classification model will use the 2019 National Survey on Drug Use and Health (NSDUH) dataset from the United States, a self-report survey data on the use of illicit drugs, alcohol, and tobacco, to establish a classification model to predict substance abuse of potential individuals.
The dataset collected questionnaires from 56136 US civilians aged 12 and older, and each of the candidate are assigned with 2741 features. The survey questions included the history and recency of using the following drugs: alcohol, tobacco, marijuana, cocaine (including crack), hallucinogens, heroin, inhalants, pain relievers, tranquilizers, stimulants, and sedatives. Respondents were also asked about demographical information which includes: gender, race, age, educational level, employment status, income level, health care access and coverage illegal activities and arrest records. The respondents were also asked about if they have taken drug that fall into schedule I or II in US, such as heroin, LSD, ecstasy, cocaine and etc., and recency of intake.
In the data set, there are certain questions that could be legitimately skipped, leading to an NA value or a value with 98, 99 or 999 meaning that the participant is not willing to answer the question or not applicable. Also, in order to record the using recency, rather than using a simple categorical variable with only yes or no, some drug intake questions may have ordinal variable (1 to 5 for example) to indicate the recency.
Since the original dataset is consisted of a large number of features (2741), we need to clean the dataset is to select the relevant features. Our features for the classification model are selected in two main aspects:
Firstly, demographical characteristics will be considered as influencing factors for substance abuse, such as age, educational level, race, employment status, income level, arrest records. These features are selected because they can profile the characteristic, living environment and social class of a respondent, which may reflect whether a respondent can access drugs easily and is possible to be persuaded to use drugs. Secondly, uses of some drugs that cause potential addiction is also considered as essential features, such as using recency of tobacco, alcohol, marijuana, and pain killer. Among the survey question, the dependence is mainly represented by the frequency of drug use within a period before the survey and the time interval between the last does of drug to the answering of questionnaire. Generally, the shorter the interval the higher the frequency of usage and the higher the dependence there should be.
Our labels are the drugs that are classified as schedule I or II in US, which have a high potential for abuse if used. These labels include using recency of the following drugs: heroin, LSD, ecstasy, cocaine, methamphetamine, methadone, hydromorphone, peyote, oxycodone, fentanyl, amphetamine and methylphenidate. If respondents have ever used these drugs during the past 12 months, we will return the value into 1 as the label of abused drug uses.
Due to the settings of the survey, many columns have high ratio of null values, as the questions in the same aspect may be asked in a previous question and not applicable for the same respondent. Multiple questions in the same aspect finally divide the respondents into multiple groups. We select the synthesized columns of data to avoid processing the null values of skipped questions. The synthesized column conclude the sub-questions in the same aspect. The rest null values are omissions due to personal reasons of respondents. These missing values are ignored by removing the whole rows. The following figure shows that the proportion of the rest missing values is low enough to be given up.
knitr::include_graphics("images/The removed missing values of synthesized features.png")
Fig 1. The removed missing values of synthesized features
We rearrange the values with integers to represent the dependency of drugs or conditions of persons. The original numbers are actually categorical and do not represent the conditions linearly. For example, for CIGREC, option 1 indicates smoking within the past 30 days, option 2 is more than 30 days but within 12 months, and option 3 is more than 12 months but within 3 years. The increases of the time intervals are not linear. Thus, we cannot understand the number of these options as the time but the extent of dependency on cigarette. Besides, some columns have disordered arrangement with some skipped answers in the ordered options. After rearrangement, the size of the number basically represents the extent of aspects. For instance, the 0-30 of CIG30USE indicates the interval of cigarette uses from short to long with removing some ambiguous options, telling the dependency on cigarette, and 1-4 indicates an increasing income level.
After rearrangement, we produced the percent stacked barcharts to compare the contribution of each value to a total (sum) across categories of features, and a pie chart for the label (figure 2).
knitr::include_graphics("images/age.png")
knitr::include_graphics("images/alcus30d.png")
knitr::include_graphics("images/cig30av.png")
knitr::include_graphics("images/cig30use.png")
knitr::include_graphics("images/mrdaypyr.png")
knitr::include_graphics("images/newrace2.png")
knitr::include_graphics("images/alcdays.png")
knitr::include_graphics("images/label_pie.png")
Fig 2. Distribution of varies attributes
Principal Compoenent analysis (PCA) is one of the most popular technique in the field of dimensionality reduction in large data. The plot of cumulative proportion of variance explained (figure 3) show it requires at least 10 principle components to captures at least 80% of the variances. Even though there are some redundancies among our features, considering the limitation of the number of relevant features, we decided to not give up more information for classification. Thus, dimensionality reduction techniques like PCA is not applied finally.
knitr::include_graphics("images/pca.png")
knitr::include_graphics("images/cpve.png")
Fig 3. PCA Scores Plot and Cumulative Proportion of Variance Explained
In order to further find out the relationship between attributes within the data set, especially attributes that are influential to the target “label,” a correlation matrix in the form of heatmap was produced (figure 4). A correlation matrix provided a visual way to conveys the correlation coefficients between variables, it displays positive and negative correlation values within colored calls, with the hue of the colors corresponding to the values. It is noticed that the features of questions in the same aspect often have a deeper color, either red or blue, and no feature has a strong correlation to the label.
knitr::include_graphics("images/correlation_matrix.png")
Fig 4. Correlation Matrix
These charts and the pie chart of the distribution of “label” class (in figure 2) reveals the hidden fact about the dataset. The dataset is in fact highly class imbalanced, the distribution of observations across the “label” classes is bias. “TRUE” label has only 18.93% examples in datatset, while “FALSE” label has 81.07%. There are more observations in the label of not abused drug uses. In a binary classification problem, class imbalance would result in model having low predictive accuracy for the infrequent class [1].
In order to mitigate the issue of imbalance data, we consider the Synthetic Minority Over-Sampling Technique (SMOTE). SMOTE takes a random data point from the minority class and find its K-nearest neighbors that belongs to the minority class, and then randomly select one datapoint from these neighbours, after that, it will draw a line between the chosen datapoint and the selected datapoint and randomly pick one point on that line. That point will be the new synthetic data point produced. We can also customized the K value and specify the repeat time of such process. However too much synthetic data points may lead to bad performance on the positive class, thus, for this project, we simply use the dup_size equals to 1, doubled up the data points in the minority of the orginal dataset. Our models would be trained on both dataset, and evaluates the performance in comparison to see the effect of SMOTE.
knitr::include_graphics("images/beforeSMOTE.png")
knitr::include_graphics("images/AfterSMOTE.png")
Fig 5. Distribution of Lable before and after SMOTE
As shown in Figure 2, the ranges of features vary a lot. Some algorithms such as KNN and SVM require a similar variance of features. Otherwise, the feature with a large variance may domain the classification and the rest features would be inoperative, so each feature is scaled to the range of 0-1.
load('NSDUH_2019.RData')
PUF <- PUF2019_100920
PUF.col <- colnames(PUF)
#PUF.col.lower <- tolower(PUF.col)
feature.names <- c("cigever","cigrec","CIG30USE","CIG30AV","alcever","alcrec",
"ALCUS30D","alcdays","ALCBNG30D","mjever","mjrec","mrdaypyr",
"pnrnmlif","pnrnmrec","PNRNM30FQ","booked","AGE2","irsex",
"NEWRACE2","eduhighcat","irwrkstat","ANYHLTI2","income","MJDAY30A")
feature <- PUF[, feature.names]
#feature.col.bool <- PUF.col.lower %in% feature.col
#feature <- PUF[, feature.col.bool]
relevant.label.names <- c("cocever","cocrec","herever","herrec","lsd","peyote",
"ecstmolly","lsdrec","irecstmorec","methamflag","methamyr",
"oxcopdapyu","fentpdapyu","hydmpdapyu","mtdnpdapyu",
"amphetapyu","methpdapyu" )
label.relevant <- PUF[, relevant.label.names]
#label.col.bool <- PUF.col.lower %in% label.col
#label.relevant <- PUF[, label.col.bool]
label <- apply(label.relevant, 1, function(x) {
(x[1]==1 & x[2]<9) | (x[3]==1 & x[4]<9) | (x[5]==1 & x[6]<9) |
(x[7]==1 & x[8]==1) | (x[9]==1 & x[10]<3) | x[11]==1 | x[12]==1 |
x[13]==1 | x[14]==1 | x[15]==1 | x[16]==1 | x[17]==1
})
# cigever means 'EVER SMOKED A CIGARETTE'
# 1 = Yes
# 0 = No
feature[['cigever']][which(feature[['cigever']] == 2)] <- 0
#levels(factor(feature[['cigever']]))
# cigrec means 'TIME SINCE LAST SMOKED CIGARETTES'
#levels(factor(feature[['cigrec']]))
# From 1-4, the time interval gradually increases
# 5 = NEVER USED CIGARETTES
# All logically assigned are considered invalid
feature[['cigrec']][which(feature[['cigrec']] == 8)] <- NA
feature[['cigrec']][which(feature[['cigrec']] == 9)] <- NA
feature[['cigrec']][which(feature[['cigrec']] == 11)] <- NA
feature[['cigrec']][which(feature[['cigrec']] == 14)] <- NA
feature[['cigrec']][which(feature[['cigrec']] == 19)] <- NA
feature[['cigrec']][which(feature[['cigrec']] == 29)] <- NA
feature[['cigrec']][which(feature[['cigrec']] == 91)] <- 5
#levels(factor(feature[['cigrec']]))
# here we show 'cigever' and 'cigrec' as examples,
# the processings are similar for the rest, which are not shown.
# CIG30AV means AVG # CIGS SMOKED PER DAY/ON DAY SMOKED IN LAST 30 DAYS
# From 1-7, the smoked per day increases
# 0 = NEVER USED CIGARETTES
# 93 = DID NOT USE CIGARETTES IN THE PAST 30 DAYS, so replace 93 as 0
# from 0-7, the higher value is, the more cigarettes smoked
#levels(factor(feature[['CIG30AV']]))
feature[['CIG30AV']][which(feature[['CIG30AV']] == 94)] <- NA
feature[['CIG30AV']][which(feature[['CIG30AV']] == 97)] <- NA
feature[['CIG30AV']][which(feature[['CIG30AV']] == 98)] <- NA
feature[['CIG30AV']][which(feature[['CIG30AV']] == 91)] <- 0
feature[['CIG30AV']][which(feature[['CIG30AV']] == 93)] <- 0
#levels(factor(feature[['CIG30AV']]))
# CIG30USE
# from 1-30, the day smoking
# 0 = NEVER USED CIGARETTES
# 93 = DID NOT USE CIGARETTES IN THE PAST 30 DAYS, so replace 93 as 0
# from 0-30, the higher value is, the more frequent responsors smoked
#levels(factor(feature[['CIG30USE']]))
feature[['CIG30USE']][which(feature[['CIG30USE']] == 94)] <- NA
feature[['CIG30USE']][which(feature[['CIG30USE']] == 97)] <- NA
feature[['CIG30USE']][which(feature[['CIG30USE']] == 98)] <- NA
feature[['CIG30USE']][which(feature[['CIG30USE']] == 91)] <- 0
feature[['CIG30USE']][which(feature[['CIG30USE']] == 93)] <- 0
#levels(factor(feature[['CIG30USE']]))
# alcever means EVER HAD DRINK OF ALCOHOLIC BEVERAGE
# 1 = Yes
# 0 = No
#levels(factor(feature[['alcever']]))
feature[['alcever']][which(feature[['alcever']] == 2)] <- 0
feature[['alcever']][which(feature[['alcever']] == 85)] <- NA
feature[['alcever']][which(feature[['alcever']] == 94)] <- NA
feature[['alcever']][which(feature[['alcever']] == 97)] <- NA
#levels(factor(feature[['alcever']]))
# alcrec means TIME SINCE LAST DRANK ALCOHOLIC BEVERAGE
# From 1-3, time interval gradually increases
# 4 = NEVER USED ALCOHOL
# All logically assigned is considered invalid
#levels(factor(feature[['alcrec']]))
feature[['alcrec']][which(feature[['alcrec']] == 91)] <- 4
feature[['alcrec']][which(feature[['alcrec']] == 8)] <- NA
feature[['alcrec']][which(feature[['alcrec']] == 9)] <- NA
feature[['alcrec']][which(feature[['alcrec']] == 11)] <- NA
feature[['alcrec']][which(feature[['alcrec']] == 85)] <- NA
feature[['alcrec']][which(feature[['alcrec']] == 97)] <- NA
feature[['alcrec']][which(feature[['alcrec']] == 98)] <- NA
#levels(factor(feature[['alcrec']]))
# ALCUS30D means USUAL # OF DRINKS PER DAY PAST 30 DAYS
# From 1-85, drinks per day gradually increases
# 0 = NEVER USED ALCOHOL
# 993 = DID NOT USE ALCOHOL IN THE PAST 30 DAYS, so replace 993 as 0
# All logically assigned is considered invalid
#levels(factor(feature[['ALCUS30D']]))
feature[['ALCUS30D']][which(feature[['ALCUS30D']] == 993)] <- 0
feature[['ALCUS30D']][which(feature[['ALCUS30D']] == 991)] <- 0
feature[['ALCUS30D']][which(feature[['ALCUS30D']] == 975)] <- NA
feature[['ALCUS30D']][which(feature[['ALCUS30D']] == 985)] <- NA
feature[['ALCUS30D']][which(feature[['ALCUS30D']] == 994)] <- NA
feature[['ALCUS30D']][which(feature[['ALCUS30D']] == 997)] <- NA
feature[['ALCUS30D']][which(feature[['ALCUS30D']] == 998)] <- NA
#levels(factor(feature[['ALCUS30D']]))
# alcdays
# the higher value is, the more days alcohol used
#levels(factor(feature[['alcdays']]))
feature[['alcdays']][which(feature[['alcdays']] == 91)] <- 0
feature[['alcdays']][which(feature[['alcdays']] == 93)] <- 0
feature[['alcdays']][which(feature[['alcdays']] == 85)] <- NA
feature[['alcdays']][which(feature[['alcdays']] == 94)] <- NA
feature[['alcdays']][which(feature[['alcdays']] == 97)] <- NA
feature[['alcdays']][which(feature[['alcdays']] == 98)] <- NA
#levels(factor(feature[['alcdays']]))
# ALCBNG30D means DAYS HAD FOUR/FIVE OR MORE DRINKS PAST 30 DYS
# From 0-30, days gradually increases
# 0 = NEVER USED ALCOHOL
# 93 = DID NOT USE ALCOHOL IN THE PAST 30 DAYS, so replace 93 as 0
# All logically assigned is considered invalid
#levels(factor(feature[['ALCBNG30D']]))
feature[['ALCBNG30D']][which(feature[['ALCBNG30D']] == 93)] <- 0
feature[['ALCBNG30D']][which(feature[['ALCBNG30D']] == 91)] <- 0
feature[['ALCBNG30D']][which(feature[['ALCBNG30D']] == 80)] <- NA
feature[['ALCBNG30D']][which(feature[['ALCBNG30D']] == 85)] <- NA
feature[['ALCBNG30D']][which(feature[['ALCBNG30D']] == 94)] <- NA
feature[['ALCBNG30D']][which(feature[['ALCBNG30D']] == 97)] <- NA
feature[['ALCBNG30D']][which(feature[['ALCBNG30D']] == 98)] <- NA
#levels(factor(feature[['ALCBNG30D']]))
# mjever means EVER USED MARIJUANA/HASHISH
# 1 = Yes
# 0 = No
#levels(factor(feature[['mjever']]))
feature[['mjever']][which(feature[['mjever']] == 2)] <- 0
feature[['mjever']][which(feature[['mjever']] == 94)] <- NA
feature[['mjever']][which(feature[['mjever']] == 97)] <- NA
#levels(factor(feature[['mjever']]))
# mjrec means TIME SINCE LAST USED MARIJUANA/HASHISH
# From 1-3, time interval gradually increases
# 4 = NEVER USED ALCOHOL
# All logically assigned is considered invalid
#levels(factor(feature[['mjrec']]))
feature[['mjrec']][which(feature[['mjrec']] == 91)] <- 4
feature[['mjrec']][which(feature[['mjrec']] == 8)] <- NA
feature[['mjrec']][which(feature[['mjrec']] == 9)] <- NA
feature[['mjrec']][which(feature[['mjrec']] == 11)] <- NA
feature[['mjrec']][which(feature[['mjrec']] == 97)] <- NA
feature[['mjrec']][which(feature[['mjrec']] == 98)] <- NA
#levels(factor(feature[['mjrec']]))
# mrdaypyr means DAYS USED MARIJUANA/HASHISH PAST 12 MONTHS
# From 1-366, days gradually increases
# 0 = NEVER USED MARIJUANA
# 993 = DID NOT USE MARIJUANA IN THE PAST 30 DAYS, so replace 993 as 0
# All logically assigned is considered invalid
#levels(factor(feature[['mrdaypyr']]))
feature[['mrdaypyr']][which(feature[['mrdaypyr']] == 991)] <- 0
feature[['mrdaypyr']][which(feature[['mrdaypyr']] == 993)] <- 0
feature[['mrdaypyr']][which(feature[['mrdaypyr']] == 985)] <- NA
feature[['mrdaypyr']][which(feature[['mrdaypyr']] == 989)] <- NA
feature[['mrdaypyr']][which(feature[['mrdaypyr']] == 994)] <- NA
feature[['mrdaypyr']][which(feature[['mrdaypyr']] == 997)] <- NA
feature[['mrdaypyr']][which(feature[['mrdaypyr']] == 998)] <- NA
feature[['mrdaypyr']][which(feature[['mrdaypyr']] == 999)] <- NA
#levels(factor(feature[['mrdaypyr']]))
# pnrnmlif means EVER USED PAIN RELIEVER NOT DIRECTED BY DR
# 1 = Yes
# 2 = No
# 5 = Yes LOGICALLY ASSIGNED (from skip pattern), so replace 5 as 1
#levels(factor(feature[['pnrnmlif']]))
feature[['pnrnmlif']][which(feature[['pnrnmlif']] == 2)] <- 0
feature[['pnrnmlif']][which(feature[['pnrnmlif']] == 91)] <- 0
feature[['pnrnmlif']][which(feature[['pnrnmlif']] == 5)] <- 1
feature[['pnrnmlif']][which(feature[['pnrnmlif']] == 85)] <- NA
feature[['pnrnmlif']][which(feature[['pnrnmlif']] == 94)] <- NA
feature[['pnrnmlif']][which(feature[['pnrnmlif']] == 97)] <- NA
feature[['pnrnmlif']][which(feature[['pnrnmlif']] == 98)] <- NA
#levels(factor(feature[['pnrnmlif']]))
# pnrnmrec means RC - MOST RECENT PAIN RELIEVER MISUSE (RECODE)
# From 1-3, time interval gradually increases
# 4 = NEVER USED pain reliever
# All logically assigned is considered invalid
#levels(factor(feature[['pnrnmrec']]))
feature[['pnrnmrec']][which(feature[['pnrnmrec']] == 91)] <- 4
feature[['pnrnmrec']][which(feature[['pnrnmrec']] == 8)] <- NA
feature[['pnrnmrec']][which(feature[['pnrnmrec']] == 9)] <- NA
feature[['pnrnmrec']][which(feature[['pnrnmrec']] == 83)] <- NA
feature[['pnrnmrec']][which(feature[['pnrnmrec']] == 98)] <- NA
#levels(factor(feature[['pnrnmrec']]))
# PNRNM30FQ
#levels(factor(feature[['PNRNM30FQ']]))
feature[['PNRNM30FQ']][which(feature[['PNRNM30FQ']] == 83)] <- 0
feature[['PNRNM30FQ']][which(feature[['PNRNM30FQ']] == 91)] <- 0
feature[['PNRNM30FQ']][which(feature[['PNRNM30FQ']] == 93)] <- 0
feature[['PNRNM30FQ']][which(feature[['PNRNM30FQ']] == 85)] <- NA
feature[['PNRNM30FQ']][which(feature[['PNRNM30FQ']] == 94)] <- NA
feature[['PNRNM30FQ']][which(feature[['PNRNM30FQ']] == 97)] <- NA
feature[['PNRNM30FQ']][which(feature[['PNRNM30FQ']] == 98)] <- NA
#levels(factor(feature[['PNRNM30FQ']]))
# booked means EVER ARRESTED AND BOOKED FOR BREAKING THE LAW
# 1 = Yes
# 0 = No
# 3 = Yes LOGICALLY ASSIGNED (from skip pattern), so replace 3 as 1
#levels(factor(feature[['booked']]))
feature[['booked']][which(feature[['booked']] == 2)] <- 0
feature[['booked']][which(feature[['booked']] == 3)] <- 1
feature[['booked']][which(feature[['booked']] == 85)] <- NA
feature[['booked']][which(feature[['booked']] == 94)] <- NA
feature[['booked']][which(feature[['booked']] == 97)] <- NA
feature[['booked']][which(feature[['booked']] == 98)] <- NA
#levels(factor(feature[['booked']]))
# AGE2 means RECODE - FINAL EDITED AGE
# From 1-17, ages gradually increases
#levels(factor(feature[['AGE2']]))
# irsex means IMPUTATION REVISED GENDER
# 1 = Male
# 2 = Female
#levels(factor(feature[['irsex']]))
# NEWRACE2 means RC-RACE/HISPANICITY RECODE (7 LEVELS)
#levels(factor(feature[['NEWRACE2']]))
# eduhighcat means RC-EDUCATION CATEGORIES
#levels(factor(feature[['eduhighcat']]))
# irwrkstat means EMPLOYMENT STATUS - IMPUTATION REVISED
# 99 = 12-14 year olds, so replace 99 as 0
#levels(factor(feature[['irwrkstat']]))
feature[['irwrkstat']][which(feature[['irwrkstat']] == 99)] <- 0
#levels(factor(feature[['irwrkstat']]))
# ANYHLTI2 means COVERED BY ANY HEALTH INSURANCE - RECODE
# 1 = Yes
# 0 = No
#levels(factor(feature[['ANYHLTI2']]))
feature[['ANYHLTI2']][which(feature[['ANYHLTI2']] == 94)] <- NA
feature[['ANYHLTI2']][which(feature[['ANYHLTI2']] == 97)] <- NA
feature[['ANYHLTI2']][which(feature[['ANYHLTI2']] == 98)] <- NA
feature[['ANYHLTI2']][which(feature[['ANYHLTI2']] == 2)] <- 0
#levels(factor(feature[['ANYHLTI2']]))
# income means RC-TOTAL FAMILY INCOME RECODE
#levels(factor(feature[['income']]))
# MJDAY30A
#levels(factor(feature[['MJDAY30A']]))
feature[['MJDAY30A']][which(feature[['MJDAY30A']] == 85)] <- NA
feature[['MJDAY30A']][which(feature[['MJDAY30A']] == 94)] <- NA
feature[['MJDAY30A']][which(feature[['MJDAY30A']] == 97)] <- NA
feature[['MJDAY30A']][which(feature[['MJDAY30A']] == 98)] <- NA
feature[['MJDAY30A']][which(feature[['MJDAY30A']] == 91)] <- 0
feature[['MJDAY30A']][which(feature[['MJDAY30A']] == 93)] <- 0
#levels(factor(feature[['MJDAY30A']]))
df <- cbind(feature, label)
df.complete <- na.omit(df)
#table(df.complete[, 'label'])
#Remeber to install this package
library(smotefamily)
library(ggplot2)
set.seed(5003)
index <- createDataPartition(df.complete$label, p = 0.7, list = FALSE)
#saparate the label columnd from the data.
train_ <- df.complete[index, ]
test_ <- df.complete[-index, ]
train_y <- train_[["label"]]
train_x <- train_[,-25]
#Apply SMOTE, we can twist the K value here.
balanced.data <- SMOTE(train_x, train_y, K=5)
#what get returned is a list, we have to extract the data from it
train.smote <- balanced.data$data
train.smote$label <- as.factor(train.smote$class)
#Returned class will apply a new column called class, we can discard this.
train.smote<- train.smote[,-25]
#Show the balanced dataset.
ggplot(train.smote, aes(x=label)) +
geom_bar(width = 0.1, fill = "#E69F00")+theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+ggtitle("Fig8: Label distribution")
In the selection of evaluation indicators, in addition to drawing the confusion matrix for each algorithm, the common two-category evaluation indicators (Accuracy, Recall, ROC, AUC, P-value) are selected which all fit the objectives of this experiment. The main indicator of this experiment is Recall, which represents the proportion of all samples that are correctly predicted. Considering that the purpose of this project is to identify people with drug addiction tendencies, it should focus on identifying as many drug users as possible without failing, and the proportions of the labels TRUE and FALSE are different, so the proportion of TRUE is much smaller but the importance is higher, so Recall can better evaluate the proportion of missed judgments and show the results of concern. What’s more, the accuracy which means the proportion of correctly predicted samples in all samples will be combined to use to see if this model is great. The second indicator is ROC and AUC. ROC is a visualization, which can make the result more intuitive. It’s x-axis is FPR(1-Specificity) = FP/(FP+TN) and y-axis is TPR(Sensitivity) = TP/(TP+FN). AUC is the area under the ROC curve. The closer to 1, the more perfect. They can all reflect the comprehensive situation of the model. Because the model on this dataset may classify all values to be true to gain a high recall, so Recall and Accuracy need to be considered at the same time. It should not sacrifice too much accuracy to improve recall, so ROC and AUC are used as a reference. What’s more,P-value will be used to see whether the model is suitable for this data set or not.If P-value is less than 0.05, the model effect is obvious and it can be used for this data set.
Logistic Regreesion is used in various field, mostly medical fields and social sciences. Since our dataset is also a medical based research, we would like to check out how Logistic Regression will fit here. For the Logistic Regression, the method we apply here includes the basic Logistic Regression, and the Logistic Regression with Ridge and Lasso Penalty. And we have applied them on both original dataset and the SMOTE dataset to see if there is any improvement. We have also tuned the regularization parameter by using 10 fold grid search to find the parameter that can produce the best results.
knitr::include_graphics("images/Sens_lr.png")
knitr::include_graphics("images/AUC_lr.png")
Fig 6. Sensitivity and AUC results before SMOTE
As you can see that for the training dataset, the AUC and Sensitivity figures shows the model performs really well on the training dataset. Moreover, it appears that for both Ridge Penalty and Lasso Penalty Logistic Regression, with a higher regularization parameter, the Sesentivity will also be higher, but for AUC, Ridge Penalty will have a less AUC value with a higher regularization paramter whereas the Lasso Penalty one keeps steady as the regularization parameter increases. By using the best tuning parameter and computing the confusion matrix on the test dataset, we can see that for both basic and regularized logistic regression models, the total TRUE label has been predicted is too small, this once again proves our guess that an imbalanced dataset will lead to a bad performance. Which as a result that both the basic and the one with penalty performs bad in terms of the sensitivity(recall). With a sensitivity value of only 0.286. This means that the classifier is unable to classify the data points to TRUE class. There is a high chance this classifier will classify someone who is addicted on drugs to the group who are not. Thus Logistic Regression trained on the original dataset is not usable. We will then check how Logistic Regression performs on the SMOTE dataset.
knitr::include_graphics("images/CM_lr.png")
knitr::include_graphics("images/penaltyCM_lr.png")
Fig 7. Confusion Matrix for LRs before SMOTE
We then use the same model, but on the dataset that has been applied SMOTE.
knitr::include_graphics("images/compare_lr.png")
knitr::include_graphics("images/compare_smote_sens.png")
Fig 8 Sensitivity and AUC results after SMOTE
From the figures we can see that with a higher Regularization Parameter, the recall gradually increases for both of the regularized logistic regressions. But for AUC, it seems like the Regularization Parameter only affects the Ridge Penalty Logistic Regression. And the patterns do not get affected by SMOTE. As for the test dataset, we can see that for both basic and regularized logistic regression models, there is a significant increase in the true positive value, but a slightly decrease in the true false value, which means that by applying smote, the recall is way better, but not significantly impacting the accuracy, with an sensitivity of 0.5188 and accuracy of 0.8198. We can see the sensitivity is nearly doubled. And for the accuracy, since the P-Value(Acc > NIR) is still lower than 0.05, which means the model is improved a lot, and it is good enough for predicting the TRUE class. Since even the SMOTE dataset is still biased, the performance can be better improved with more data realted to the field.
knitr::include_graphics("images/cm_smote_lr.png")
knitr::include_graphics("images/penaltyCM_smote_lr.png")
Fig 9. Confusion Matrix for LRs after SMOTE
From our judgement, the Logistic Regression is good enough for the dataset after applying SMOTE, but not the original dataset. The performance of the classifier has indeed improved significantly after applying SMOTE. But we also notice that using SMOTE will impact on the accuracy, by setting the dup_size equals to one, we can still maintain the P-Value of Logistic Regression to be higher than 0.05. We believe that if we collect more related data will further improve the performance of the model.
library(caret)
library(ggplot2)
library(tidyverse)
library(mlbench)
library(gridExtra)
library(grid)
library(MASS)
library(e1071)
library(car)
library(pROC)
library(MLmetrics)
set.seed(5003)
index <- createDataPartition(df.complete$label, p = 0.7, list = FALSE)
train_ <- df.complete[index, ]
test_ <- df.complete[-index, ]
dim(train_)
dim(test_)
train_$label<-as.factor(train_$label)
#train_$label<-lapply(train_$label, as.numeric)
typeof(train_$label)
test_$label<-as.factor(test_$label)
#test_$label<-lapply(test_$label, as.numeric)
typeof(test_$label)
levels(train_$label) <- make.names(levels(factor(train_$label)))
levels(test_$label) <- make.names(levels(factor(test_$label)))
?trainControl
set.seed(5003)
MySummary <- function(data, lev = NULL, model = NULL){
a1 <- defaultSummary(data, lev, model)
b1 <- twoClassSummary(data, lev, model)
c1 <- prSummary(data, lev, model)
out <- c(a1, b1, c1)
out}
trControl = trainControl(method = "repeatedcv",
repeats = 5, savePredictions = TRUE,summaryFunction = MySummary,classProbs = TRUE )
train_y <- train_[["label"]]
train_x <- train_[,-25]
logisticReg <- caret::train(label~., data = train_, method = "glm", trControl=trControl, metric="ROC")
logisticReg$results
library(yardstick)
library(ggplot2)
logisticReg_train.pred <- predict(logisticReg, newdata = train_)
accuracy_logisticReg_train <- confusionMatrix(data = logisticReg_train.pred, reference = train_$label, positive="TRUE.")
accuracy_logisticReg_train
logisticReg_test.pred <- predict(logisticReg, newdata = test_)
accuracy_logisticReg_test <- confusionMatrix(data = logisticReg_test.pred, reference = test_$label, positive="TRUE.")
accuracy_logisticReg_test
Y<-data.frame(logisticReg_test.pred,test_$label)
cm <- conf_mat(Y,test_.label,logisticReg_test.pred)
g=autoplot(cm, type = "heatmap") +
scale_fill_gradient(low="#D6EAF8",high = "#2E86C1")
g+ theme(legend.position = "right") + ggtitle("Confusion Matrix for Basic LR")
reg.logisticReg <- caret::train(label~., data = train_, method = "glmnet", trControl=trControl, tuneGrid = expand.grid(alpha = c(0,1),lambda = seq(0.001,0.1,by = 0.001)), metric="ROC"
)
X<-split(reg.logisticReg$results, reg.logisticReg$results$alpha)
plot(x=(X$`1`)$lambda,y=(X$`1`)$Sens, col="blue", type="l", xlab="Regularization Parameter", ylab="Sensitivity",main="Sensitivity against Regularization Parameter")
lines(x=(X$`0`)$lambda,y=(X$`0`)$Sens,col="red",type="l")
legend(x="topleft",legend=c("Ridge", "Lasso"),
col=c("blue", "red"), lty=1:1, cex=0.8)
plot(x=(X$`1`)$lambda,y=(X$`1`)$AUC, col="blue", type="l",xlab="Regularization Parameter", ylab="AUC",main="ROC against Regularization Parameter")
lines(x=(X$`0`)$lambda,y=(X$`0`)$AUC,col="red",type="l")
legend(x="bottomleft",legend=c("Ridge", "Lasso"),
col=c("blue", "red"), lty=1:1, cex=0.8)
reg.logisticReg.pred <- predict(reg.logisticReg, newdata = train_)
accuracy_reglogisticReg_train <- confusionMatrix(data = reg.logisticReg.pred, reference = train_$label, positive="TRUE.")
accuracy_reglogisticReg_train
reg.logisticReg_test.pred <- predict(reg.logisticReg, newdata = test_)
accuracy_reglogisticReg_test <- confusionMatrix(data = reg.logisticReg_test.pred, reference = test_$label, positive="TRUE.")
accuracy_reglogisticReg_test
Y<-data.frame(reg.logisticReg_test.pred,test_$label)
cm <- conf_mat(Y,test_.label,reg.logisticReg_test.pred)
g=autoplot(cm, type = "heatmap") +
scale_fill_gradient(low="#D6EAF8",high = "#2E86C1")
g+ theme(legend.position = "right") + ggtitle("Confusion Matrix for Penalty LR")
ggplot(train_, aes(x=label)) +
geom_bar(width = 0.1, fill = "#E69F00")+theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+ggtitle("Fig8: Label distribution")
library(smotefamily)
balanced.data <- SMOTE(train_x, train_y, K=5, dup_size = 1)
train.smote <- balanced.data$data # extract only the balanced dataset
train.smote$label <- as.factor(train.smote$class)
train.smote<- train.smote[,-25]
ggplot(train.smote, aes(x=label)) +
geom_bar(width = 0.1, fill = "#E69F00")+theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+ggtitle("Fig8: Label distribution")
logisticReg.smote <- caret::train(label~., data = train.smote, method = "glm", trControl=trControl, metric="ROC")
logisticReg_smote_train.pred <- predict(logisticReg.smote, newdata = train.smote)
accuracy_logisticReg_smote_train <- confusionMatrix(data = logisticReg_smote_train.pred, reference = train.smote$label, positive="TRUE.")
accuracy_logisticReg_smote_train
logisticReg_smote_test.pred <- predict(logisticReg.smote, newdata = test_)
test_$label <- as.factor(test_$label)
accuracy_logisticReg_smote_test <- confusionMatrix(data = logisticReg_smote_test.pred, reference = test_$label, positive="TRUE.")
accuracy_logisticReg_smote_test
Y<-data.frame(logisticReg_smote_test.pred,test_$label)
cm <- conf_mat(Y,test_.label,logisticReg_smote_test.pred)
g=autoplot(cm, type = "heatmap") +
scale_fill_gradient(low="#D6EAF8",high = "#2E86C1")
g+ theme(legend.position = "right")+ ggtitle("Confusion Matrix for LR(SMOTE)")
reg.logisticReg.smote <- caret::train(label~., data = train.smote, method = "glmnet", trControl=trControl, tuneGrid = expand.grid(alpha = c(0,1),lambda = seq(0.001,0.1,by = 0.001)), metric="ROC"
)
plot(reg.logisticReg.smote)
Y<-split(reg.logisticReg.smote$results, reg.logisticReg.smote$results$alpha)
plot(x=(Y$`1`)$lambda,y=(Y$`1`)$Sens, col="blue", type="l", xlab="Regularization Parameter", main="Sensitivity against Regularization Parameter", ylab="Sensitivity")
lines(x=(Y$`0`)$lambda,y=(Y$`0`)$Sens,col="red",type="l")
lines(x=(X$`1`)$lambda,y=(X$`1`)$Sens, col="purple", type="l")
lines(x=(X$`0`)$lambda,y=(X$`0`)$Sens,col="pink",type="l")
legend(x="topleft",legend=c("Ridge-SMOTE", "Lasso-SMOTE", "Ridge","Lasso"),
col=c("blue", "red", "purple", "pink"), lty=1:1, cex=0.8)
plot(x=(Y$`1`)$lambda,y=(Y$`1`)$AUC, col="blue", type="l",xlab="Regularization Parameter",main="ROC against Regularization Parameter", ylab="AUC")
lines(x=(Y$`0`)$lambda,y=(Y$`0`)$AUC,col="red",type="l")
lines(x=(X$`1`)$lambda,y=(X$`1`)$AUC, col="purple", type="l")
lines(x=(X$`0`)$lambda,y=(X$`0`)$AUC,col="pink",type="l")
legend(x="bottomleft",legend=c("Ridge-SMOTE", "Lasso-SMOTE", "Ridge","Lasso"),
col=c("blue", "red","purple", "pink"), lty=1:1, cex=0.8)
reg.logisticReg_smote.pred <- predict(reg.logisticReg.smote, newdata = train.smote)
accuracy_reglogisticReg_smote_train <- confusionMatrix(data = reg.logisticReg_smote.pred, reference = train.smote$label, positive="TRUE.")
accuracy_reglogisticReg_smote_train
reg.logisticReg_smote_test.pred <- predict(reg.logisticReg.smote, newdata = test_)
accuracy_reglogisticReg_smote_test <- confusionMatrix(data = reg.logisticReg_smote_test.pred, reference = test_$label, positive="TRUE.")
accuracy_reglogisticReg_smote_test
Y<-data.frame(reg.logisticReg_smote_test.pred,test_$label)
cm <- conf_mat(Y,test_.label,reg.logisticReg_smote_test.pred)
g=autoplot(cm, type = "heatmap") +
scale_fill_gradient(low="#D6EAF8",high = "#2E86C1")
g+ theme(legend.position = "right")+ ggtitle("Confusion Matrix for Penalty LR(SMOTE)")
The principle of the SVM classifier is to maximize the geometric interval between categories on the basis of correct classification in the feature space. Maximizing the interval is equivalent to solving the problem of minimizing hinge loss. Generally, for a data set with a large amount of data and linearly separable, the linear SVM has a good classification effect. Compared with the SVM of the kernel function, the linear SVM has a shorter training time and is widely used. From figure below, for the Linear SVM classifier, the smaller the C, the higher the sensitivity. This is because the smaller the C, the lower the penalty for misclassification. In our training, metrix uses recall, so the pursuit of the highest Accuracy is not the purpose of tuning. In the case of relatively high accuracy, the parameter that can increase recall as much as possible is the final consideration. Moreover, it can be seen from the figure that the class weight parameter has little effect on Linear SVM regardless of the data before and after the smote.
knitr::include_graphics("images/cv_linear_svm.jpg")
knitr::include_graphics("images/cv_linear_svm_smote.jpg")
Fig 10. Grid search to find best parameter for linear svm
knitr::include_graphics("images/cm_linear_svm.jpg")
knitr::include_graphics("images/cm_linear_svm_smote.jpg")
Fig 11. Confusion Matrix for linear svm
For Linear SVM trained using original train data, the accuracy was found to be 0.7284 and the recall value was found to be 0.7418. But when the data after the smooth is used for training, there is no change in the Linear SVM classifier. Both accuracy and recall remain completely unchanged. But because the model’s accuracy < no information rate, the model is basically meaningless. This may be because when using Linear kernel, in order to increase recall as much as possible, the number of False Positive is greatly increased, so Accuracy is reduced too much.
When using RBF as the SVM kernel, the SVM classifier is a non-linear classifier. In the classification process, SVM maps the features to higher-dimensional space, and replaces the inner product with the kernel function. Compared with linear classification, this SVM classifier takes longer time to do training. From the figure below, whether the data has been smoted or not, the larger the sigma, the better the classification effect. But when processing the data after the smooth, the smaller the C, the better the model classification effect. This is because after the data imbalance is alleviated, the error penalty does not need to be large, and a smaller error penalty can get a better training effect.
knitr::include_graphics("images/cv_rbf_svm.jpg")
knitr::include_graphics("images/cv_rbf_svm_smote.jpg")
Fig 12. Grid search to find best parameter for rbf svm
knitr::include_graphics("images/cm_rbf_svm.jpg")
knitr::include_graphics("images/cm_rbf_svm_smote.jpg")
Fig 13. Confusion Matrix for rbf svm
For RBF SVM trained using original train data, the accuracy was found to be 0.8366 and the rbf value was found to be 0.2347.For RBF SVM trained using data after smote, the accuracy was found to be 0.8328 and the rbf value was found to be 0.4149.
Therefore, although accuracy has dropped a little bit, it is still greater than the No information rate. At the same time, Sensitivity is significantly improved. Smoting has a great optimization effect on the RBF SVM classifier.
In conclusion, Linear SVM cannot be used as a usable model. The recall of RBF SVM is relatively low. Compared with linear models such as logistic regression, the effect is not so good.
atrain <- train.smote
atest <- test.set
atrain$label <- factor(train.smote$label, levels = c('TRUE', 'FALSE'), labels = c('T', 'F'))
atest$label <- factor(test.set$label, levels = c('TRUE', 'FALSE'), labels = c('T', 'F'))
str(atrain)
atrain
preProcValues <- preProcess(atrain, method = c('range'))
train.Transformed <- predict(preProcValues, atrain)
test.Transformed <- predict(preProcValues, atest)
train.Transformed
test.Transformed
ggplot(train.Transformed, aes(x=label)) +
geom_bar(width = 0.1, fill = "#E69F00")+theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+ggtitle("Fig8: Label distribution")
summary(train.Transformed$label)
svmlineargrid.smote <- expand.grid(cost = c(0.25, 1),
weight = c(0.25,0.5))
svmlinearmodel.smote <- train(label ~ ., data = train.Transformed, method = "svmLinearWeights", trControl = trainControl(method = "repeatedcv", repeats = 1, classProbs = TRUE, summaryFunction = twoClassSummary),
metric = 'Sens', tuneGrid = svmlineargrid.smote
)
svmlinearmodel.smote
trellis.par.set(caretTheme())
plot(svmlinearmodel.smote, metric = "Sens")
svmlineartest.smote <- predict(svmlinearmodel.smote, newdata = test.Transformed)
caret::confusionMatrix(svmlineartest.smote, test.Transformed$label)
library(pROC)
pre_label <- ifelse(svmlineartest.smote == 'T', 1, 0)
test_label <- ifelse(test.Transformed$label == 'T', 1, 0)
roc_qda <- roc(response = test_label, predictor = pre_label)
plot(roc_qda, col="red", lwd=3, main="ROC curve QDA")
install.packages('yardstick')
library(yardstick)
y <- data.frame(svmlineartest.smote, test.Transformed$label)
cm <- conf_mat(y, test.Transformed.label, svmlineartest.smote)
library(ggplot2)
g = autoplot(cm, type = 'heatmap') + scale_fill_gradient(low = '#D6EAF8', high = '#2E86C1')
g + theme(legend.position = 'right') + ggtitle('Confusion Matrix for Linear SVM after smote')
svmlineargrid <- expand.grid(cost = c(0.25, 1),
weight = c(0.25,0.5,1))
svmlinearmodel <- train(label ~ ., data = train.Transformed, method = "svmLinearWeights", trControl = trainControl(method = "repeatedcv", repeats = 2, classProbs = TRUE, summaryFunction = twoClassSummary),
metric = 'Sens', tuneGrid = svmlineargrid, positive = 'T'
)
svmlinearmodel
trellis.par.set(caretTheme())
plot(svmlinearmodel, metric = "Sens")
ggplot(svmlinearmodel)
svmlineartest <- predict(svmlinearmodel, newdata = test.Transformed)
caret::confusionMatrix(svmlineartest, test.Transformed$label)
library(pROC)
pre_label <- ifelse(svmlineartest == 'T', 1, 0)
test_label <- ifelse(test.Transformed$label == 'T', 1, 0)
roc_qda <- roc(response = test_label, predictor = pre_label)
roc_qda
plot(roc_qda, col="red", lwd=3, main="ROC")
install.packages('yardstick')
library(yardstick)
y <- data.frame(svmlineartest, test.Transformed$label)
cm <- conf_mat(y, test.Transformed.label, svmlineartest)
library(ggplot2)
g = autoplot(cm, type = 'heatmap') + scale_fill_gradient(low = '#D6EAF8', high = '#2E86C1')
g + theme(legend.position = 'right') + ggtitle('Confusion Matrix for Linear SVM')
In this experiment, the KNN algorithm was used to establish one of the classification models for the selected data set. KNN (K-Nearest Neighbor, KNN for short) algorithm is one of the simpler machine learning algorithms, and it is also a commonly used supervised learning method. Its working mechanism is: Given a test sample, find the k training samples closest to it in the training set based on a certain distance metric, and then determine the class of the sample based on the class label of these k training samples x to make predictions. Select the category mark that appears most in these k samples as the test result. When measuring the distance or similarity between the sample to be classified and the training set sample, the Euclidean distance is generally used. The definition of Euclidean distance ρ is as follows: For the Euclidean distance in a two-dimensional space, the distance between a point (x1, y1) and a point (x2, y2) can be expressed as:
EuclideanDistance \((d)=\sqrt{\left(x_{2}-x_{1}\right)^{2}+\left(y_{2}-y_{1}\right)^{2}}\)
In the process of KNN, we will standardize the data. Because our data sets have different magnitudes. In this case, when we need to calculate the distance between samples, the value range of a certain feature is very large, so the distance calculation is Mainly depends on this feature, the influence of other features will be very low, but the actual situation may be that the feature with a small value range is more important, so if the standardization process is not performed, the effect of the final model will be worse. Standardization will scale the data so that it falls into a small specific interval, which can remove the unit limit and magnitude limit of the data, so that subsequent processing such as comparison and weighting can be carried out between the indicators.
The model established in this experiment mainly calls the knn3 method of the caret package, and a total of three models are established. For the first model, after standardization process on the data set, we use the grid adjustment method to find the parameter that can maximize the Recall value from the value of k from 5 to 51, and then output the confusion matrix, AUC value and ROC of model one under this parameter. curve. Model 2 first performed a smooth balancing process on the data set, and then tried to adjust the grid parameters without standardizing the data set to find the optimal solution. The third model is the first smooth processing and then standardization, and then the grid parameters are adjusted to find the model under the optimal parameters.
The value of k for model one is 45. At this time, although the Sensitivity of the model 0.683 is much high, and the accuracy rate 0.9214 is also very high , but the P-vale is 1, and the No information rate is 0.9214, so that the KNN model is completely unsuitable for this data set. The possible reason is that the sample distribution is too irregular, and the KNN model is not suitable for complex data sets. . The AUC value at this time is only 0.626,
The k value of model two is 51. At this time, the model Sensitivity is 0.376, and the Accuracy is 0.726. Although compared with the model 1, these values are lower, the P-value is far less than 0.05, so the model is usable. It shows that the method of smooth balancing the data set has worked, but the KNN model itself still does not have a very good classification effect on the data set. The AUC value of the model in this case is 0.709.
The value of k for model three is 45. It can be seen from the figure that for each model, the larger the value of k, the better the effect of the model. In fact, this also shows from the side that the learning ability of the model is weak, and it is getting closer and closer to the average level of the data. The “face” is considered to make judgments, but the specific situation of a sample cannot be considered. At this time, the Recall value is 0.4375, the Accuracy is 0.7759, and the P-value is also much less than 0.05, indicating that the model has a better effect on the basis of availability. The AUC value at this time is 0.726. The confusion matrix of the model is shown in the figure below.
knitr::include_graphics("images/knnPicture1.png")
knitr::include_graphics("images/knnPicture2.png")
Fig 14. Grid search to find best parameter for knn and Confusion Matrix for knn
The ROC curves of the three models are shown in the figure.
knitr::include_graphics("images/knnPicture3.png")
Fig 15. ROC curve of KNN
## here is my KNN part
## not use smote & do data normalization
### import the package
#Remeber to install this package
library(smotefamily)
library(ggplot2)
library(caret)
library(e1071)
library(lattice)
library(kknn)
library(pROC)
library(yardstick)
library(dplyr)
library(tidyr)
### Split the data
set.seed(5003)
index <- createDataPartition(df.complete$label, p = 0.7, list = FALSE)
#saparate the label columnd from the data.
train_ <- df.complete[index, ]
test_ <- df.complete[-index, ]
train_y <- train_[["label"]]
train_x <- train_[,-25]
lab.num <- ifelse(test_$label == FALSE, 0, 1)# for plot ROC curve use
### Do normolization and standarlization
preProcValues <- preProcess(train_, method = c('range'))
train.scale <- predict(preProcValues, train_)
test.scale <- predict(preProcValues, test_)
head(train.scale)
### Train the model
k_value_51 = seq(5, 51, by = 2)
recall.test.1 <- c()
set.seed(5003)
for (val in k_value_51){
knn.model <- knn3(label ~., data = train.scale, k = val)
knn.pre <- predict(knn.model, newdata = test.scale)
knn.pre.lab <- ifelse(knn.pre[,1]>knn.pre[,2],"FALSE","TRUE" )
#result <- ifelse(knn.pre.lab == test_$label, TRUE, FALSE)
#summary(result)
cfm <- confusionMatrix(factor(test.scale$label),factor(knn.pre.lab),positive = "TRUE")
recall <- cfm$byClass[1]
recall.test.1 <- c(recall.test.1, recall)
}
### plot the k-sensitivity
plot(k_value_51, recall.test.1, type = "b", xlab = "K value", ylab = "Recall",main = "Recall of kNN before Somte")
### get the confusionmatrix
set.seed(5003)
knn.model1 <- knn3(label ~., data = train.scale, k = 45)
knn.pre1 <- predict(knn.model1, newdata = test.scale)
knn.pre.lab1 <- ifelse(knn.pre1[,1]>knn.pre1[,2],"FALSE","TRUE" )
#result1 <- ifelse(knn.pre.lab1 == test_$label, TRUE, FALSE)
#summary(result1)
cfm1 <- confusionMatrix(factor(test.scale$label),factor(knn.pre.lab1),positive = "TRUE")
cfm1
### plot ROC
knn.pre.lab1.num <- ifelse(knn.pre.lab1 == FALSE, 0, 1)
knn_roc1 <- roc(lab.num,knn.pre.lab1.num)
plot(knn_roc1, print.auc=TRUE, print.thres=TRUE,main='ROC curve of knn before smote(k=45)')
###plot heatmap
V <- data.frame(as.character(lab.num),as.character(knn.pre.lab1.num))
cm <- conf_mat(V, as.character.lab.num., as.character.knn.pre.lab1.num. )
g = autoplot(cm, type = "heatmap") +
scale_fill_gradient(low = "#D6EAF8", high ="#2E86C1")
g + theme(legend.position = "right") + ggtitle("Confusion Matrix for KNN before Smote(k=45)")
## First do smote,then donot do normalization
### do smote
#Apply SMOTE, we can twist the K value here.
balanced.data <- SMOTE(train_x, train_y, K=5)
#what get returned is a list, we have to extract the data from it
train.smote <- balanced.data$data
train.smote$label <- as.factor(train.smote$class)
head(train.smote)
#Returned class will apply a new column called class, we can discard this.
train.smote<- train.smote[,-25]
#Show the balanced dataset.
ggplot(train.smote, aes(x=label)) +
geom_bar(width = 0.1, fill = "#E69F00")+theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+ggtitle("Fig8: Label distribution")
### train the model
k_value_51 = seq(5, 51, by = 2)
recall.test.sm <- c()
set.seed(5003)
for (val in k_value_51){
knn.model <- knn3(label ~., data = train.smote, k = val)
knn.pre <- predict(knn.model, newdata = test_)
knn.pre.lab <- ifelse(knn.pre[,1]>knn.pre[,2],"FALSE","TRUE" )
#result <- ifelse(knn.pre.lab == test.set$label, TRUE, FALSE)
#summary(result)
cfm <- confusionMatrix(factor(test_$label),factor(knn.pre.lab),positive = "TRUE")
recall <- cfm$byClass[1]
recall.test.sm <- c(recall.test.sm, recall)
}
### plot the k-sensitivity
plot(k_value_51, recall.test.sm, type = "b", xlab = "K value", ylab = "Recall",main = "KNN after Somte(Not normalization)")
### get the confusionmatrix
set.seed(5003)
knn.model2 <- knn3(label ~., data = train.smote, k = 47)
knn.pre2 <- predict(knn.model2, newdata = test_)
knn.pre.lab2 <- ifelse(knn.pre2[,1]>knn.pre2[,2],"FALSE","TRUE" )
result2 <- ifelse(knn.pre.lab2 == test_$label, TRUE, FALSE)
summary(result2)
confusionMatrix(factor(test_$label),factor(knn.pre.lab2),positive = "TRUE")
### plot ROC
#plot the ROC
knn.pre.lab2.num <- ifelse(knn.pre.lab2 == FALSE, 0, 1)
knn_roc2 <- roc(lab.num,knn.pre.lab2.num)
plot(knn_roc2, print.auc=TRUE, auc.polygon=TRUE, grid=c(0.1, 0.2),grid.col=c("green", "red"), max.auc.polygon=TRUE,auc.polygon.col="skyblue", print.thres=TRUE,main='ROC curve of knn after somte ')
## Use the data set that already do smote,then do normalization
### do normalization
preProcValues <- preProcess(train.smote, method = c('range'))
train.smote.scale <- predict(preProcValues, train.smote)
testS.smote.scale <- predict(preProcValues, test_)
head(train.smote.scale)
k_value_51 = seq(5, 51, by = 2)
recall.test.smsc <- c()
set.seed(5003)
for (val in k_value_51){
knn.model <- knn3(label ~., data = train.smote.scale, k = val)
knn.pre <- predict(knn.model, newdata = testS.smote.scale)
knn.pre.lab <- ifelse(knn.pre[,1]>knn.pre[,2],"FALSE","TRUE" )
cfm <- confusionMatrix(factor(testS.smote.scale$label),factor(knn.pre.lab),positive = "TRUE")
recall <- cfm$byClass[1]
recall.test.smsc <- c(recall.test.smsc, recall)
}
### plot k-sensitivity
plot(k_value_51, recall.test.smsc, type = "b", xlab = "K value", ylab = "Recall",main = "Recall of Knn after Smote")
### get the best confusionmatrix
set.seed(5003)
knn.model3 <- knn3(label ~., data = train.smote.scale, k = 45)
knn.pre3 <- predict(knn.model3, newdata = testS.smote.scale)
knn.pre.lab3 <- ifelse(knn.pre3[,1]>knn.pre3[,2],"FALSE","TRUE" )
#result3 <- ifelse(knn.pre.lab3 == testS.smote.scale$label, TRUE, FALSE)
#summary(result3)
confusionMatrix(factor(testS.smote.scale$label),factor(knn.pre.lab3),positive = "TRUE")
### plot ROC
#plot the ROC
knn.pre.lab3.num <- ifelse(knn.pre.lab3 == FALSE, 0, 1)
knn_roc3 <- roc(lab.num,knn.pre.lab3.num)
plot(knn_roc3, print.auc=TRUE, auc.polygon=TRUE, grid=c(0.1, 0.2),grid.col=c("green", "red"), max.auc.polygon=TRUE,auc.polygon.col="skyblue", print.thres=TRUE,main='ROC curve of knn after somte and Scale')
V <- data.frame(as.character(lab.num),as.character(knn.pre.lab3.num))
cm <- conf_mat(V, as.character.lab.num., as.character.knn.pre.lab3.num. )
g = autoplot(cm, type = "heatmap") +
scale_fill_gradient(low = "#D6EAF8", high ="#2E86C1")
g + theme(legend.position = "right") + ggtitle("Confusion Matrix for KNN after Smote(k=45)")
## Plot the graph
#result <- data.frame(cbind(recall.test.1, recall.test.sm, recall.test.smsc))
plot(k_value_51, result$recall.test.1, type = "l", xlab = "K value", ylab = "Recall",main = "KNN",col = "black")
lines(k_value_51, result$recall.test.sm, col = "blue")
lines(k_value_51, result$recall.test.smsc, col = "red")
legend("bottomright", legend=c("Before Smote", "After Smote(not normalization)", "After Smote"), col=c("black","blue","red"), lty=1,lwd=2)
plot(knn_roc1, main='ROC curve of knn')
par(new=TRUE)
plot(knn_roc2, col = "blue", xaxt="n", yaxt="n", ann=F, bty="n")
par(new=TRUE)
plot(knn_roc3, print.auc=TRUE, print.thres=TRUE,col = "red", xaxt="n", yaxt="n", ann=F, bty="n")
legend("bottomright", legend=c("Before Smote", "After Smote(not normalization)", "After Smote"), col=c("black","blue","red"), lty=1,lwd=2)
Decision tree is a supervised machine learning algorithm which can be visualized by an upside-down tree-like diagram. A classification tree is consisted of one or most likely multiple nodes called decision nodes and each of the nodes is a binary true false question. The result of the binary question would then lead to another node until either the true or false subset consists of only pure labels. For this project, both train data before and after smote processing were used to configure the decision tree. As shown in figure 16 and figure 17 by running grid search on depth of the tree, it was determined that when using 3 as the maximum depth, the model returned the highest recall (sensitivity) value. For smote-processed data, the model return the highest recall (sensitivity) when the depth of tree was chosen to be 7. Therefore, the decision tree model was built with depth of 3 for original data set and for smote-processed data, the decision tree model was built with depth of 7.
knitr::include_graphics("images/cv_dt.png")
knitr::include_graphics("images/cv_dt_smote.png")
Fig 16. Grid search to find best parameter for decision tree
knitr::include_graphics("images/cm_dt.png")
knitr::include_graphics("images/cm_dt_smote.png")
Fig 17. Confusion Matrix for decision Tree
For decision tree trained using original train data, the accuracy was found to be 0.84 and the recall (sensitivity) value was found to be 0.295. And for decision tree trained using smote-processed data, the accuracy was found to be 0.83 as well as the recall (sensitivity) was found to be 0.45. The usage of smote significantly increased the model’s recall, with a slightly lower accuracy. Therefore, decision tree model trained with smote-processed data was retained.
Random forest is an ensemble learning method. As stated in the name, rather than having a single decision tree, multiple trees are constructed to form a forest. Another main difference is the training data used to construct the trees, rather than using the complete training set, the input training data are randomly selected from the original training data. Such that each trees in the forest would have different training set, yet some observation might be used to train more than one tree. The result of the random forest is determined by the majority vote method. Meaning that each tree would output a prediction for a given test data, and the most occurring one would be the output for the random forest.
knitr::include_graphics("images/cv_rf.png")
knitr::include_graphics("images/cv_rf_smote.png")
Fig 18. Grid search to find best parameter for random forest
For random forest, both training data before and after smote processing were used to assemble the forest. By using grid search for parameter tuning, as shown in figure18 the result showed that for random forest trained with original data and smote-process data, the model would have the highest recall (sensitivity) value when using only one tree. Therefore, the number of tree was set to be 1 when training random forest using both original data and smote-processed data.
knitr::include_graphics("images/cm_rf.png")
knitr::include_graphics("images/cm_rf_smote.png")
Fig 19. Confusion Matrix for random forest
For random forest trained with the original data set, the accuracy was found to be 0.8411 and the recall (sensitivity) was found to be 0.339. And for random forest trained using smote-processed data, the accuracy was found to be 0.8034 and the recall (sensitivity) was found to be 0.35. For random forest, the model trained with smote-processed data had slightly higher recall, therefore, the model trained with smote-processed data was retained.
Boosting is as well an ensemble learning method aims to improve prediction performance from a single decision tree. Unlike random forest using a subset of the original training data, boosting uses a modified version of the training data to train the tree. First, one tree is constructed by feeding all training data, then to test the tree with the training data. Some of the training observation might not be well predicted, then those observations will have higher chance to be selected to construct the next tree and so on, until certain number of trees is reached. As in random forest, the prediction will still follow the majority vote algorithm.
knitr::include_graphics("images/cv_boosting.png")
knitr::include_graphics("images/cv_boosting_smote.png")
Fig 20. Grid search to find best parameter for boosting
For boosting, models were trained using gbm package. By using grid search for paramter tuning, as shown in figure 20 the result showed that for model trained with the original data set, boosting was not applicable. The recall (sensitivity) was always 0 meaning that the model just simply predict every samples as false to reach a high accuracy. For the model trained with smote-processed data, the model return the highest recall (sensitivity) value when the number of tree was set to be 50. Therefore, the model was trained only with the smote-processed data and the number of tree was set to be 50.
knitr::include_graphics("images/cm_boosting_smote.png")
Fig 21. Confusion Matrix for boosting
For boosting model trained with smote-processed data, the accuracy was found to be 0.83 and the recall (sensitivity) was found to be 0.45. The boosting model had both the highest accuracy and recall (sensitivity) among all three models. Therefore, among the three decision tree related models, only boosting model trained with smote-processed data was retained. There were two parts that worth discussing. First was that with larger number of tree, the random forest would have a lower recall. One possible explanation for that was one of the features was so significant, such that only by using that feature would be enough to make the prediction. Causing the increase in number of tree resulting in lower recall (sensitivity) value. Another point was that when hyper tuning the boosting model with original data, the recall (sensitivity) was always 0, showing that the model predict every samples as false. One potential explanation for that might be since the original data was imbalanced, the algorithm tried to maximize the accuracy by predicting every sample as false, which would produce a recall (sensitivity) with value 0.
set.seed(5003)
library(rpart)
library(rpart.plot)
depth = c(1:10)
recall.test <- c()
for (val in depth){
treeCon.over <- rpart.control(minsplit = 2, minbucket = 1, cp = 0, xval = 10, maxdepth = val)
fit <- rpart(label~., data = train_, method = 'class', control = treeCon.over)
predict_unseen <-predict(fit, test_, type = 'class')
cfm <- confusionMatrix(as.factor(predict_unseen),
as.factor(test_$label),
positive = "TRUE")
recall <- cfm$byClass[1]
recall.test <- c(recall.test, recall)
rpart.plot(fit, extra = 106)
}
plot(depth, recall.test, type = "b", xlab = "Depth of tree", ylab = "Recall", main = "Grid search to find best parameter")
##Using best depth for decision tree pre smote
set.seed(5003)
library(pROC)
#treeCon.over <- rpart.control(minsplit = 2, minbucket = 1, cp = 0, xval = 10, maxdepth = 3)
fit <- rpart(label~., data = train_, method = 'class')
predict_unseen <-predict(fit, test_, type = 'class')
cfm <- confusionMatrix(as.factor(predict_unseen),
as.factor(test_$label),
positive = "TRUE")
pre_label <- ifelse(predict_unseen == TRUE, 1, 0)
test_label <- ifelse(test_$label == TRUE, 1, 0)
roc_qda_decision_tree_pre_smote <- roc(response = test_label, predictor = pre_label)
plot(roc_qda_decision_tree_pre_smote, col="red", lwd=3, main="Sensitivity vs Specificity for decision tree")
AUC_1 <-auc(roc_qda_decision_tree_pre_smote)
cfm
AUC_1
##heatmap cm decision tree pre smote
library(ggplot2)
library(yardstick)
set.seed(5003)
Y <- data.frame(predict_unseen, test_$label)
cm <- conf_mat(Y, test_.label, predict_unseen)
g=autoplot(cm, type = "heatmap") + scale_fill_gradient (low="#D6EAF8",high = "#2E86C1")
g + theme (legend.position = "right") + ggtitle ("Confusion Matrix for Decision Tree")
##decision para tuning after smote
library(rpart)
library(rpart.plot)
set.seed(5003)
depth = c(2:10)
recall.test <- c()
for (val in depth){
treeCon.over <- rpart.control(minsplit = 2, minbucket = 1, cp = 0, xval = 10, maxdepth = val)
fit <- rpart(label~., data = train.smote, method = 'class', control = treeCon.over)
predict_unseen <-predict(fit, test_, type = 'class')
cfm <- confusionMatrix(as.factor(predict_unseen),
as.factor(test_$label),
positive = "TRUE")
recall <- cfm$byClass[1]
recall.test <- c(recall.test, recall)
rpart.plot(fit, extra = 106)
}
plot(depth, recall.test, type = "b", xlab = "Depth of tree", ylab = "Recall", main = "Grid search to find best parameter (SMOTE)")
##Using best depth for decision tree after smote
library(pROC)
set.seed(5003)
#treeCon.over <- rpart.control(minsplit = 2, minbucket = 1, cp = 0, xval = 10, maxdepth = 3)
fit <- rpart(label~., data = train.smote, method = 'class')
predict_unseen <-predict(fit, test_, type = 'class')
cfm <- confusionMatrix(as.factor(predict_unseen),
as.factor(test_$label),
positive = "TRUE")
pre_label <- ifelse(predict_unseen == TRUE, 1, 0)
test_label <- ifelse(test_$label == TRUE, 1, 0)
roc_qda_decision_tree_after_smote <- roc(response = test_label, predictor = pre_label)
plot(roc_qda_decision_tree_after_smote, col="red", lwd=3, main="Sensitivity vs Specificity for decision tree (Smote)")
AUC_2 <-auc(roc_qda_decision_tree_after_smote)
cfm
AUC_2
##heatmap cm decision tree after smote
library(ggplot2)
library(yardstick)
set.seed(5003)
Y <- data.frame(predict_unseen, test_$label)
cm <- conf_mat(Y, test_.label, predict_unseen)
g=autoplot(cm, type = "heatmap") + scale_fill_gradient (low="#D6EAF8",high = "#2E86C1")
g + theme (legend.position = "right") + ggtitle ("Confusion Matrix for Decision Tree (Smote)")
##random forest para tuning pre smote
library(ranger)
set.seed(5003)
recall.test <- c()
ntree.seq <- c(seq(from = 1, to = 100, by = 50))
for (i in ntree.seq) {
rf.model <- ranger(label ~., data = train_, num.trees = i)
rf.test <- predict(rf.model, data = test_)
label <- ifelse(rf.test$predictions == 1, TRUE, FALSE)
cfm <- confusionMatrix(as.factor(label),
as.factor(test_$label),
positive = "TRUE")
recall <- cfm$byClass[1]
recall.test <- c(recall.test, recall)
}
plot(ntree.seq, recall.test, type = "b", xlab = "Number of tree", ylab = "Recall", main = "Grid search to find best parameter")
##random forest with hyper para pre smote
set.seed(5003)
rf.model <- ranger(label ~., data = train_, num.trees = 1)
rf.test <- predict(rf.model, data = test_)
predict_unseen <- ifelse(rf.test$predictions == 1, TRUE, FALSE)
cfm <- confusionMatrix(as.factor(label),
as.factor(test_$label),
positive = "TRUE")
pre_label <- ifelse(predict_unseen == TRUE, 1, 0)
test_label <- ifelse(test_$label == TRUE, 1, 0)
roc_qda_random_forest_pre_smote <- roc(response = test_label, predictor = pre_label)
plot(roc_qda_random_forest_pre_smote, col="red", lwd=3, main="Sensitivity vs Specificity for random forest")
AUC_3 <-auc(roc_qda_random_forest_pre_smote)
cfm
AUC_3
##heatmap cm random forest
library(ggplot2)
library(yardstick)
set.seed(5003)
Y <- data.frame(as.factor(predict_unseen), test_$label)
cm <- conf_mat(Y, test_.label, as.factor.predict_unseen.)
g=autoplot(cm, type = "heatmap") + scale_fill_gradient (low="#D6EAF8",high = "#2E86C1")
g + theme (legend.position = "right") + ggtitle ("Confusion Matrix for Random Forest")
##random forest para tuning after smote
library(ranger)
set.seed(5003)
recall.test <- c()
ntree.seq <- c(seq(50, 500, by=50), seq(1000, 3000, by=500))
for (i in ntree.seq) {
rf.model <- ranger(label ~., data = train.smote, num.trees = i)
rf.test <- predict(rf.model, data = test_)
cfm <- confusionMatrix(as.factor(rf.test$predictions),
as.factor(test_$label),
positive = "TRUE")
recall <- cfm$byClass[1]
recall.test <- c(recall.test, recall)
}
plot(ntree.seq, recall.test, type = "b", xlab = "Number of tree", ylab = "Recall", main = "Grid search to find best parameter (SMOTE)")
##random forest with hyper para after smote
set.seed(5003)
rf.model <- ranger(label ~., data = train.smote, num.trees = 1)
rf.test <- predict(rf.model, data = test_)
predict_unseen <- rf.test$predictions
cfm <- confusionMatrix(as.factor(rf.test$predictions),
as.factor(test_$label),
positive = "TRUE")
pre_label <- ifelse(predict_unseen == TRUE, 1, 0)
test_label <- ifelse(test_$label == TRUE, 1, 0)
roc_qda_random_forest_post_smote <- roc(response = test_label, predictor = pre_label)
plot(roc_qda_random_forest_post_smote, col="red", lwd=3, main="Sensitivity vs Specificity for random forest (Smote)")
AUC_4 <-auc(roc_qda_random_forest_post_smote)
cfm
AUC_4
##heatmap cm random forest after smote
library(ggplot2)
library(yardstick)
set.seed(5003)
Y <- data.frame(as.factor(predict_unseen), test_$label)
cm <- conf_mat(Y, test_.label, as.factor.predict_unseen.)
g=autoplot(cm, type = "heatmap") + scale_fill_gradient (low="#D6EAF8",high = "#2E86C1")
g + theme (legend.position = "right") + ggtitle ("Confusion Matrix for Random Forest (Smote)")
##boosting pre smote
library(gbm)
set.seed(5003)
gbm.recall.test <- c()
ntree.seq <- c(seq(from = 50, to = 100, by =50 ))
for(i in ntree.seq){
gbm.model <- gbm(label~., data = train_,
distribution = 'gaussian', n.trees = i)
gbm.test <- predict(gbm.model, newdata = test_,
type = 'response')
gbm.label <- ifelse(gbm.test > 1.5, TRUE, FALSE)
cfm <- confusionMatrix(as.factor(gbm.label),
as.factor(test_$label),
positive = "TRUE")
gbm.recall.test <- c(gbm.recall.test, cfm$byClass[1])
}
plot(ntree.seq, gbm.recall.test, type = "b", xlab = "number of tree", ylab = "recall", main = "Grid search to find best parameter")
##gbm with hyper para pre smote
set.seed(5003)
gbm.model <- gbm(label~., data = train_,
distribution = 'gaussian', n.trees = 50)
gbm.test <- predict(gbm.model, newdata = test_,
type = 'response')
gbm.label <- ifelse(gbm.test > 1.5, TRUE, FALSE)
predict_unseen <- gbm.label
cfm <- confusionMatrix(as.factor(gbm.label),
as.factor(test_$label),
positive = "TRUE")
summary(gbm.label)
pre_label <- ifelse(gbm.label == TRUE, 1, 0)
test_label <- ifelse(test_$label == TRUE, 1, 0)
roc_qda_boosting_pre_smote <- roc(response = test_label, predictor = pre_label)
plot(roc_qda_boosting_pre_smote, col="red", lwd=3, main="Sensitivity vs Specificity for gradient boosting with gbm")
AUC_5 <-auc(roc_qda_boosting_pre_smote)
cfm
AUC_5
##boosting after smote
library(gbm)
set.seed(5003)
gbm.recall.test <- c()
ntree.seq <- c(seq(from = 1, to = 1000, by = 100))
for(i in ntree.seq){
gbm.model <- gbm(label~., data = train.smote,
distribution = 'gaussian', n.trees = i)
gbm.test <- predict(gbm.model, newdata = test_,
type = 'response')
gbm.label <- ifelse(gbm.test > 1.5, TRUE, FALSE)
cfm <- confusionMatrix(as.factor(gbm.label),
as.factor(test_$label),
positive = "TRUE")
gbm.recall.test <- c(gbm.recall.test, cfm$byClass[1])
}
plot(ntree.seq, gbm.recall.test, type = "b", xlab = "number of tree", ylab = "recall", main = "Grid search to find best parameter (SMOTE)")
##gbm with hyper para after smote
set.seed(5003)
gbm.model <- gbm(label~., data = train.smote,
distribution = 'gaussian', n.trees = 1)
gbm.test <- predict(gbm.model, newdata = test_,
type = 'response')
gbm.label <- ifelse(gbm.test > 1.5, TRUE, FALSE)
cfm <- confusionMatrix(as.factor(gbm.label),
as.factor(test_$label),
positive = "TRUE")
summary(gbm.label)
pre_label <- ifelse(gbm.label == TRUE, 1, 0)
test_label <- ifelse(test_$label == TRUE, 1, 0)
roc_qda_boosting_post_smote <- roc(response = test_label, predictor = pre_label)
plot(roc_qda_boosting_post_smote, col="red", lwd=3, main="Sensitivity vs Specificity for gradient boosting (Smote) with gbm")
AUC_6 <-auc(roc_qda_boosting_post_smote)
cfm
AUC_6
##heatmap cm gbm after smote
library(ggplot2)
library(yardstick)
set.seed(5003)
Y <- data.frame(as.factor(gbm.label), as.factor(test_$label))
cm <- conf_mat(Y, as.factor.test_.label., as.factor.gbm.label.)
g=autoplot(cm, type = "heatmap") + scale_fill_gradient (low="#D6EAF8",high = "#2E86C1")
g + theme (legend.position = "right") + ggtitle ("Confusion Matrix for boosting using gbm (Smote)")
##AUC plot
set.seed(5003)
roc_qda_decision_tree_pre_smote
roc_qda_decision_tree_after_smote
roc_qda_random_forest_pre_smote
roc_qda_random_forest_post_smote
roc_qda_boosting_pre_smote
roc_qda_boosting_post_smote
plot(roc_qda_decision_tree_pre_smote, col="red", lwd=3, main="Sensitivity vs Specificity")
lines(roc_qda_random_forest_pre_smote, col = "green")
lines(roc_qda_boosting_pre_smote, col = "blue")
legend("topright", legend =c("Decision tree, AUC = 0.6565", "Random forest, AUC = 0.6269", "Boosting, AUC = 0.5") , col=c("red", "green", "blue"), pch=1) # optional legend
plot(roc_qda_decision_tree_after_smote, col="red", lwd=3, main="Sensitivity vs Specificity (SMOTE)")
lines(roc_qda_random_forest_post_smote, col = "green")
lines(roc_qda_boosting_post_smote, col = "blue")
legend("topright", legend =c("Decision tree, AUC = 0.6563", "Random forest, AUC = 0.636", "Boosting, AUC = 0.6843") , col=c("red", "green", "blue"), pch=1) # optional legend
AUC_1
AUC_2
AUC_3
AUC_4
AUC_5
AUC_6
We considered four algorithms to train the classification model. The following table summarised the recall and AUC values of each model before and after SMOTE, respectively.
knitr::include_graphics("images/result.png")
Table 1. Summary of Results
Briefly, SVM had the highest recall value of 0.74 after SMOTE, with logistic regression of 0.5 coming second. Boosting had the most significant improvement on recall after SMOTE. However, the model should be assessed with p-value to compare whether the accuracy performs significantly better than No Information Rate. After considering significant p-value (< 0.05), logistic regression was considered as the best model. For AUC of ROC, all of the four models showed AUC values greater than 0.5 after using SMOTE. Again, logistic regression performed with the best AUC value of 0.8 after using SMOTE.
Interestingly, we found that linear models performed better in this dataset, which follows the current classification studies on medical data that Logistic Regression is one of the main statistical tools[2] . Our project also indicates that when dealing with medical data, Logistic Regression may be the most suitable model.
This binary classification model used the NSDUH dataset to predict substance abuse of potential individuals. To summarise, logistic regression achieved the best performance. We therefore conclude that our binary classification model can effectively accomplish our goal.
However, there are also some limitations to consider for future works. Firstly, our features and labels were selected manually based on general information, which may be limited by medical evidence. Therefore, future study should consider more on the medical information of the attributes of the dataset. Secondly, by using the SMOTE and setting the dup_size to 1, we have managed to improve the performance of almost all models as been stated. However, if we further set the dup_size to 2 or higher, which means more synthetic minority class data points, we could drag down the accuracy metric and make the P-Value lower than the confidence interval, which is not practical for model training. As such, future work could collect more data related to the field and test if further improvement could be made.